Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arp: optimize interface name resolution #3133

Merged
merged 1 commit into from
Jan 10, 2025

Conversation

dswarbrick
Copy link
Contributor

Go net package's net.InterfaceByIndex fetches the entire interface table behind the scenes, despite us specifying the interface index that we are interested in. This results in an exponential amount of rtnetlink traffic (and thus recvfrom syscalls) as the number of interfaces in a host increases.

Fixes: #3075

Copy link
Member

@discordianfish discordianfish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find! Thanks!

github.com/jsimonetti/rtnetlink provides a high level rtnl wrapper
around the lower level rtnetlink functions, which essentially does all
that we need. The rtnl.Conn.Neighbors uses an internal cache for
resolving interface indexes to names, so it makes at most one rtnetlink
call per interface to resolve the name.

Using this high level wrapper hugely simplifies our code and makes it
easier to understand and maintain.

Fixes: prometheus#3075

Signed-off-by: Daniel Swarbrick <[email protected]>
@maxime1907
Copy link

Any news regarding this PR?

@juliantaylor
Copy link

juliantaylor commented Nov 27, 2024

the issue being fixed is pretty problematic in e.g. calico based kubernetes clusters which have a lot of devices:

ts=2024-11-27T13:10:40.581Z caller=collector.go:173 level=debug msg="collector succeeded" name=arp duration_seconds=14.24509357

workaround until this is merged and released is to switch back to gathering via procfs via

--no-collector.arp.netlink

(limiting devices via --collector.arp.device-exclude does not help)

@dswarbrick
Copy link
Contributor Author

@SuperQ Can we get this merged before the next release?

@discordianfish
Copy link
Member

@SuperQ Yeah PTAL

@dswarbrick
Copy link
Contributor Author

I plan to backport this as a patch in the Debian package prometheus-node-exporter 1.8.2, since the first trixie freeze date is probably not far away. It would still be nice to see this merged upstream however, since it fixes a fairly important bug.

@SuperQ SuperQ merged commit acdd9b8 into prometheus:master Jan 10, 2025
7 checks passed
@dswarbrick dswarbrick deleted the arp-rtnetlink branch January 13, 2025 09:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: collector.arp.netlink is super slow to retrieve metrics
5 participants